0. Import packages

In Python, most things can be done with "import amazing_package_already_done_by_someone_else as solution_to_all_my_problems" Packages can be installed with pip, or with the conda environment manager (recommended).

In [2]:
import pandas as pd #pandas gives us access to the dataframe object
import numpy as np #numpy is the basic mathematical package for vector operations
import seaborn as sns #seaborn gives us a convenient plotting API
import plotly.express as px #plotly gives us an easy way to create interactive graphs

1. Read Data

There are many ways to read data for use in Python. The simplest is to use Pandas to read a csv file. By default we use comma deliminated, utf-8 encoding. Files generated on windows with Excel can have different encoding formats depending on the version, which might require a little of plyaing with the parameters to find the right setting...

In [3]:
raw_df = pd.read_csv("winemag-data_first150k.csv") #the file is saved in the same folder as the notebook, so no need to give a path
In [4]:
type(raw_df) #since we opened the data with Pandas, it's already a DataFrame object
Out[4]:
pandas.core.frame.DataFrame

2. Data Analysis

Let's see what data we have inside the dataframe

In [5]:
raw_df.head() 
Out[5]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery
0 0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
2 2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley
3 3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
4 4 France This is the top wine from La Bégude, named aft... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude

head() is a function. I can get help on any function with help(). Having pandas' documentation open is a tab is also a good idea: https://pandas.pydata.org/docs/

In [6]:
help(pd.DataFrame.head)
Help on function head in module pandas.core.generic:

head(self: ~FrameOrSeries, n: int = 5) -> ~FrameOrSeries
    Return the first `n` rows.
    
    This function returns the first `n` rows for the object based
    on position. It is useful for quickly testing if your object
    has the right type of data in it.
    
    For negative values of `n`, this function returns all rows except
    the last `n` rows, equivalent to ``df[:-n]``.
    
    Parameters
    ----------
    n : int, default 5
        Number of rows to select.
    
    Returns
    -------
    same type as caller
        The first `n` rows of the caller object.
    
    See Also
    --------
    DataFrame.tail: Returns the last `n` rows.
    
    Examples
    --------
    >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
    ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
    >>> df
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    5     parrot
    6      shark
    7      whale
    8      zebra
    
    Viewing the first 5 lines
    
    >>> df.head()
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    
    Viewing the first `n` lines (three in this case)
    
    >>> df.head(3)
          animal
    0  alligator
    1        bee
    2     falcon
    
    For negative values of `n`
    
    >>> df.head(-3)
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    5     parrot

It seems we have more than one country. Let's check what is there

In [7]:
raw_df["country"].unique()
Out[7]:
array(['US', 'Spain', 'France', 'Italy', 'New Zealand', 'Bulgaria',
       'Argentina', 'Australia', 'Portugal', 'Israel', 'South Africa',
       'Greece', 'Chile', 'Morocco', 'Romania', 'Germany', 'Canada',
       'Moldova', 'Hungary', 'Austria', 'Croatia', 'Slovenia', nan,
       'India', 'Turkey', 'Macedonia', 'Lebanon', 'Serbia', 'Uruguay',
       'Switzerland', 'Albania', 'Bosnia and Herzegovina', 'Brazil',
       'Cyprus', 'Lithuania', 'Japan', 'China', 'South Korea', 'Ukraine',
       'England', 'Mexico', 'Georgia', 'Montenegro', 'Luxembourg',
       'Slovakia', 'Czech Republic', 'Egypt', 'Tunisia', 'US-France'],
      dtype=object)

For the moment, I'm only interested in French wines, so let's create a French wine DataFrame

In [8]:
france_df = raw_df[raw_df["country"]=="France"]

How many French wines do we have?

In [9]:
len(france_df)
Out[9]:
21098

Q01. How many wines from "Provence" do we have in the data?

In [10]:
# type your answer in that cell

Our data contains two numerical variables: "points" and "price". Let's draw histograms to vizualize them

In [11]:
france_df[['points',"price"]].hist(bins=20)
Out[11]:
array([[<AxesSubplot:title={'center':'points'}>,
        <AxesSubplot:title={'center':'price'}>]], dtype=object)
2020-09-03T16:57:52.100060 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

The points are all between 80 and 100, but we have some crazy high prices in that data. Let's look at the 10 most expensive wines

In [12]:
france_df.sort_values(by='price',ascending=False).head(10)
Out[12]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery
34920 34920 France A big, powerful wine that sums up the richness... NaN 99 2300.0 Bordeaux Pauillac NaN Bordeaux-style Red Blend Château Latour
34922 34922 France A massive wine for Margaux, packed with tannin... NaN 98 1900.0 Bordeaux Margaux NaN Bordeaux-style Red Blend Château Margaux
83536 83536 France A wine that has created its own universe. It h... Clos du Mesnil 100 1400.0 Champagne Champagne NaN Chardonnay Krug
26296 26296 France A wine that has created its own universe. It h... Clos du Mesnil 100 1400.0 Champagne Champagne NaN Chardonnay Krug
51886 51886 France A wine that has created its own universe. It h... Clos du Mesnil 100 1400.0 Champagne Champagne NaN Chardonnay Krug
34939 34939 France The purest Cabernet Sauvignon fruit, with dark... NaN 96 1300.0 Bordeaux Pauillac NaN Bordeaux-style Red Blend Château Mouton Rothschild
34942 34942 France Solid, very structured, packed with dense and ... NaN 96 1200.0 Bordeaux Pessac-Léognan NaN Bordeaux-style Red Blend Château Haut-Brion
34927 34927 France Such a generous and ripe wine, with a dark cor... NaN 97 1100.0 Bordeaux Pessac-Léognan NaN Bordeaux-style Red Blend Château La Mission Haut-Brion
35531 35531 France This is the first vintage of a wine that used ... NaN 94 1000.0 Bordeaux Pessac-Léognan NaN Bordeaux-style White Blend Château La Mission Haut-Brion
127773 127773 France While there is certainly plenty of wood here, ... NaN 95 850.0 Bordeaux Saint-Émilion NaN Bordeaux-style Red Blend Château Ausone

Q02. Can you find the 10 cheapest wines in the French data?

In [13]:
# type your answer in that cell

Looking at 10 most expensive (and the 10 cheapest), it seems there is a relationship between price and quality (points). Let's check with a simple regression

In [14]:
sns.lmplot(x="price",y="points",data=france_df) #lm means linear model. We could also use relplot or regplot
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x1a0a9c0ca90>
2020-09-03T16:57:54.179073 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

That graph doesn't look great. The relationship isn't linear. Maybe we can use the log of the price instead. We calculate log_price variable and then redo the plot with that variable

In [15]:
france_df["log_price"] = np.log(france_df["price"]) #creates a new variable called log_price

sns.lmplot(x="log_price",y="points",data=france_df)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x1a0a9e00700>
2020-09-03T16:57:57.964060 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

All French wines are not the same, let's look at the differences in price and points by region. We can do a pivot table with groupby

In [16]:
france_df.groupby("province").agg({'price':['count','mean'],'points':['mean','max','min']})
Out[16]:
price points
count mean mean max min
province
Alsace 1359 31.876380 88.488690 97 81
Beaujolais 303 17.267327 87.092105 94 80
Bordeaux 2658 42.601956 89.497136 100 80
Burgundy 3342 70.602633 89.489090 98 81
Champagne 1089 93.412305 90.638686 100 81
France Other 243 16.189300 85.688581 92 81
Languedoc-Roussillon 1023 22.015640 86.815157 96 80
Loire Valley 1391 27.071891 88.033595 96 81
Provence 828 23.442029 88.254652 95 81
Rhône Valley 1231 49.832656 88.995448 99 80
Southwest France 1318 22.879363 88.209244 96 80

Q03. Can you create a pivot table with the count of wines by variety, their average, min, max and median prices?

In [17]:
# type your answer in that cell

Pivot tables are fine, but for a nicer looking way of looking at the data, a chart can be a better idea

In [18]:
sns.boxplot(y="province",x="points", data=france_df)
Out[18]:
<AxesSubplot:xlabel='points', ylabel='province'>
2020-09-03T16:58:00.181060 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Beaujolas wines are very cheap, but some of them might be good. What are the best Beaujolais for less than 20$?

In [19]:
france_df[(france_df['province']=="Beaujolais")&(france_df['price']<=20)].sort_values(by="points", ascending=False).head(5)
Out[19]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery log_price
43999 43999 France A densely tannic wine, showing the influence o... Les Thorins 92 20.0 Beaujolais Moulin-à-Vent NaN Gamay Lucien Lardy 2.995732
121282 121282 France Perfumed, earthy, spicy, this is a balanced wi... Cuvée Zaccharie 92 20.0 Beaujolais Côte de Brouilly NaN Gamay Château Thivin 2.995732
44862 44862 France Of all the Beaujolais crus, Morgon is the one ... NaN 91 15.0 Beaujolais Morgon NaN Gamay Chateau de Pizay 2.708050
44865 44865 France A full-bodied, rich and rounded wine, this has... Cuvée Terre de Lumière 91 20.0 Beaujolais Moulin-à-Vent NaN Gamay Château du Chatelard 2.995732
44872 44872 France Ripe and structured, this has finely balanced ... Les Roches 91 20.0 Beaujolais Fleurie NaN Gamay Lucien Lardy 2.995732

Q04. Can you find the 5 best rosés de Provence for less than 30$?

In [20]:
# type your answer in that cell

So far we've only looked at the French data. Let's look at the relative quality of wines by country

In [21]:
world_df = raw_df

Let's create a relative_value variable so we can compare

In [22]:
world_df["relative_value"] = world_df["points"] / np.log(world_df["price"])
world_df["relative_value"] = world_df["relative_value"] * 100 / np.max(world_df["relative_value"]) # we rescale from 0 to 100 (max)

And now we can compare the relative value by country

In [23]:
sns.barplot(y="country",x="relative_value", data=world_df[world_df['country'].isin(['France',"Italy","Spain","US","Chile","Australia"])])
Out[23]:
<AxesSubplot:xlabel='relative_value', ylabel='country'>
2020-09-03T16:58:02.763078 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Q05. Can you create a relative_value_alt variable that is calculated as points/price and rescaled from 0 to 1000?

In [24]:
# type your answer in that cell

French wine have the worse value? There is a problem with the "points" score in that data set. Let's correct that

I'm going to define a relative scaling factor to adjust the points according to my tastes

In [25]:
scaling_factor = {
    'France':1.1, #let's multiply all French wine scores by 1.1
    'Italy':0.8,
    'Spain':0.7,
    'US':0.9,
    'Australia':0.8,
    'Chile':0.5 #yuck
}

That's a perfect occasion to use merge. Let's transform scaling_factor into a DataFrame and merge it with word_df

In [26]:
scaling_df = pd.DataFrame(
    data=scaling_factor.items(), #items() extracts the key/values pairs from the dictionnary
    columns=["country","scaling_factor"]
)
scaling_df
Out[26]:
country scaling_factor
0 France 1.1
1 Italy 0.8
2 Spain 0.7
3 US 0.9
4 Australia 0.8
5 Chile 0.5

And now we can merge with word_df

In [27]:
country_df = world_df.merge(scaling_df, left_on="country", right_on="country", how="inner") # I'm doing an inner joint => we're dropping the countries for which I didn't define a scaling factor
country_df.head()
Out[27]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery relative_value scaling_factor
0 0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz 28.344485 0.9
1 2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley 34.390152 0.9
2 3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi 37.071102 0.9
3 8 US This re-named vineyard was formerly bottled as... Silice 95 65.0 Oregon Chehalem Mountains Willamette Valley Pinot Noir Bergström 36.684945 0.9
4 9 US The producer sources from two blocks of the vi... Gap's Crown Vineyard 95 60.0 California Sonoma Coast Sonoma Pinot Noir Blue Farm 37.402120 0.9

Now let's recalculate the relative value and replot

In [28]:
country_df["julien_value"] = country_df["points"] * country_df["scaling_factor"] / np.log(country_df["price"])
country_df["julien_value"] = country_df["julien_value"] * 100 / np.max(country_df["relative_value"]) # we rescale from 0 to 100 (max)
In [29]:
sns.barplot(y="country",x="julien_value", data=country_df)
Out[29]:
<AxesSubplot:xlabel='julien_value', ylabel='country'>
2020-09-03T16:58:05.443062 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

That looks much better!

Q06. Which types of wines do you like the best? Define a scoring scheme for the variety (maybe similar to my scaling_factor) and find out which wines have the best value for you

In [30]:
# type your answer in that cell

To finish, let's do something a little more complex. For the 5 most popular varieties, let's display an interactive graph that shows the price and points of the wines with the highest value

In [39]:
variety_short_list = country_df['variety'].value_counts()[0:4].index # taking the 5 most popular varieties

for variety in variety_short_list:
    temp_df = country_df[country_df['variety']==variety].sort_values(by="julien_value", ascending=False).head(500)
    fig = px.scatter(temp_df, size="julien_value",x="price",y="points",color="province",hover_name="winery",title=variety)
    fig.show(renderer="notebook")

Q07. What else would you like to see from the data? Use px.scatter() to create an interactive graph

In [32]:
# type your answer in that cell
In [38]:
#export to HTML for sharing
!jupyter nbconvert --to html Basic_Python_For_DataScience.ipynb
[NbConvertApp] Converting notebook Basic_Python_For_DataScience.ipynb to html
[NbConvertApp] Writing 4730711 bytes to Basic_Python_For_DataScience.html
In [ ]: